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Abstract 

The SNPs (Single Nucleotide Polymorphisms) genotyping platforms are of great 
value for gene mapping of complex diseases. Nowadays, the high-density of these 
molecular markers enables studies of dependence patterns between loci over the genome, 
allowing a simultaneous inference of dependence structure and disease association. In 
this paper we propose a method based on the theory of variable range Markov random 
fields to estimate the extent of dependence among SNPs allowing variable windows 
along the genome. The advantage of this method is that it allows the simultaneous 
prediction of dependence and independence regions among SNPs, without restricting 
a priori the range of dependence. We introduce an estimator based on the idea of pe- 
nalized maximum likelihood to find the conditional dependence neighborhood of each 
SNP in the sample and we prove its consistency. We apply our method to autosomal 
SNPs genotypic data with unknown phase in the context of case-control association 
studies. By examining rheumatoid arthritis data from the Genetic Analysis Workshop 
16 (GAW16), we show the utility of the Markov model under variable range depen- 
dence. 
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1 Introduction 



Genome-wide genetic mapping studies based on linkage disequilibrium (LD) have been 
encouraged from the availability of high-density SNPs (Single Nucleotide Polymorphisms) 
genotyping platforms. Since the individual SNP effect is expected to be small, a challenge 
has been to find blocks of SNPs with effect on diseases. This problem requires the consider- 
ation of a model of the dependence structure among loci over the genome, see for example 
Akey et al.| ( |200T| ); [Greenspan and Gelg^ ( |2006) ; |Browning| ( |2006) ; [Kirr^ 



In this paper we propose a method based on the theory of variable range Markov random 
fields to estimate the extent of dependence among SNPs allowing variable windows along 
the genome. In order to infer the range of dependence for each SNP, we propose a criterion 
based on penalized maximum likelihood. The main theoretical contribution of this paper is 
the proof of the consistency of this estimator, as the sample size diverges. 

As a consequence of our model, we show how the dependence information inferred can 
be used to construct independent blocks of SNPs that can be associated with a response vari- 
able in the context of case-control studies. Tools such as the Chi-square statistic are adopted 
to study the association of each block and the response variable. The main advantage of 
this method is that it allows, without restriction of range, the simultaneous prediction of 
dependence and independence regions among SNPs. 

Methods based on penalized maximum likelihood to infer the range of dependence of 



Markov random fields have been proposed in the literature before, see for example Csiszar 



and Talata (2006a); Locherbach and Orlandi (2011). The main difference between these 



approaches and our is that these methods assume a fixed (and symmetric) neighborhood 
for each variable, because they based the inference procedure on only one sample of the 
process. In our context, we have at hand several independent realizations of the process, 
corresponding to different individuals in the sample, and this allows us to assume an inho- 
mogeneous process with different neighborhoods for each variable. 

In order to illustrate the potential of our method to infer variable dependence struc- 
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tures and disease association we will consider rheumatoid arthritis data from the GAW16. 
Although our methodology is driven to SNP genotypic sequence, extension for allele se- 
quences is also possible. 

The paper is organized as follows. In Section [2] we define the variable range Markov 
random field, introduce the criterion to estimate the range of dependence for each variable 
and state the consistency result. In Section [3] we show how to apply our method to infer the 
dependence structure in SNPs maps and to study association with a response variable in a 
case-control study. Finally, in Section [4] we prove the main theoretical result of this paper. 

2 Variable range Markov random fields 

Let A denote a finite alphabet and let P be a probability distribution over A z , equiped with 
the usual a-algebra generated by the cylinder sets. Given an element i 6 Z, called a site, 
we will denote by X,- L the marginal random variable obtained by the canonical projection of 
elements in A z . Given two non-negative integers / and r, we will denote by A \ r the union 
of the two integer intervals to the left and to the right of site i of length I and r, respectively; 
that is A\ r = {i — I, . . . , % — 1, i+ 1, i + r}. We will also denote by X A i the random vector 
{Xj : j E A\ r } and similarly x A * will denote an element of A A ^ r . 

We say P is a variable range Markov random field on A if for any i E Z there exist two 
integers // and r, such that for all L >U and all R > we have 

F{Xi = x % | X k = x k , k G A% R ) = F{Xj = Xj | X k = x k , k e A}. >ri ) , (1) 

for all x; E A and all x A i for which P(X A ; = x A * ) > 0. 

^L,R y L,R L,R 

Observe that the size of the neighborhood A\. r . may depend on the specific site i, for 
that reason we call our model a variable range Markov random field. 

Remark 2.1. If r*) satisfies equation ([T]) then any pair (/, r) with / > ij and r > will 
also satisfy equation ([!]). For this reason in the sequel we assume U and r, are the minimal 
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integers satisfying (|TJ), calling the set A l t . r , the basic neighborhood of site i. 

Remark 2.2. Note that if Xj and Xj are not conditionally independent given all the re- 
maining variables then C A? D Af , where [z;j] denotes the integer interval 
{i, i + 1, . . . , j — 1, j}. On the other hand, if there exists i such that r 4 < £ — z for any z < £ 
and /j < j — £ for any j > £, then Xi is independent of X,- for any i < I and any j > £. 

In what follows we will focus on the problem of identifying the basic neighborhood 
of a given site i 6 Z. Without loss of generality we will take i = and we will sim- 
ply write A° o ro = A; o ro . We will assume we have an independent sample of size n of 
(X_ io , . . . , X , . . . , X Ro ), with L > l and R > r . We will denote by xy the value 
taken by the j-th variable in the i-th observation. Our goal is to estimate the basic neigh- 
borhood A; o ro (by estimating the parameters / and r ) and the conditional probabilities 
given by based on this sample. 

Given two sequences w = (w-i, . . . , W-\) G A 1 and v = {yi, . . . , v r ) G A r and a 
symbol a G A we will denote by p(a\w,v) and p(w, a, v ) the conditional (respectively 
joint) probability given by 

p(a\w,v) = P(X = a | = w,Xi :r = v) 

and 

p(w, a, v) = F(X = a, = w, X l:r = v) , 

where X^ represents the sequence Xj, . . . , Xj. The operator N n (w, a, v) will denote the 
number of occurrences of the event 

{x^ : _x = w} n {x = «} n {X 1:r } 

in the sample. That is 

n 

N n (w,a } v) = 1{^_L = wat;}, 
i=i 
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where wav is the concatenation of w, a and v; that is wav = (w~i, . . . , io_i, a, t> i, . . . , v r ) E 
A l+r+l . Given w and v, the maximum likelihood estimator of the conditional distribution 
{p(-\w, v ) : a E A} is given by 

w i n N n (w,a,v) 

Pn{a\w,v) = —— — , for a E A, (2) 

N n (w,v) 

where N n (w, v) = J2 a( z A N n (w,a,v). If N n (w,v) = Owe adopt the convention p n (a\w,v) = 
1/\A\ for all a E A. 

For any pair of integers (/, r), with / < L and r < R we denote by 

hr(4'- n v^) = n n n^^i^^)^^ • ^ 

In order to estimate the neighborhood A/ o ro ; that is, to estimate / and r , we propose 
to use a penalized maximum (conditional) likelihood criterion. 

Definition 2.3. Given a constant c > 0, the empirical neighborhood of site is the set of 
indices A ; - - ={—/„,..., — 1, +1, r n }, where 



(t n ,r n ) = argmax { logPi ir (xJ 1:n) |a;^ :n) ) - c \A\ l+r log, A , n } . (4) 

0<Z<L ,0<r<i? 



We prove the following consistency result for the neighborhood estimator. 

Theorem 2.4. The estimator given by (4) satisfies (Z n , r n ) = (7 , 7"o) and therefore A? 
AP eventually almost surely as n — > oo. 



The proof of Theorem 2.4 is given in Section |4j 
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3 Variable dependence windows for SNPs maps and dis- 
ease association 



In this paper we model data from 2,062 individuals (72.4% of them females) in which 868 
are affected by rheumatoid arthritis (cases) and 1,194 are not affected (controls). This data 
is the initial batch of whole genome association data for the North American Rheumatoid 
Arthritis Consortium (NARAC) provided by the Genetic Analysis Workshop 16 (GAW16). 
For all individuals in this case-control study, information from 545,080 SNP-genotype from 
the Illumina 550K chip are available. The genotypes are in the format X_X, where X is 
a base (A,C,G,T). Each record has information about SNP name, chromosome, and SNP 
position in basepairs. Only genotypes from the 22 autosomal chromosomes were used in 
our analysis. For each SNP the scores 0, 1 and 2 were assigned for their three possible SNP 
genotypes. Thus, for instance, a SNP in which their genotypes are GG, AA, and GA, score 
was assigned for the homozygote with highest frequency, score 1 for the heterozygote, 
and score 2 for the homozygote with lowest frequency. 

In order to remove potential genotype errors, we excluded those SNPs with minor allele 
frequency (MAF) lower than 1% as well as those not in Hardy -Weinberg equilibrium (HWE). 
The HWE was checked by using the Chi-square test available in the genetics library of the R 



package (R Development Core Team 2005). The significance level considered was 10~ 4 . 
At the end, a total of 43,616 SNPs were removed with 501,464 remaining for the analysis. 
No procedure was used to impute the missing genotypes that remained in the data set. 

We apply the model and estimators described in Section [2] to this data set. We obtained 
in this way neighborhoods for each one of the 501, 464 SNPs. Considering the entire data 
set we obtained neighborhoods with mean 2.22418 SNPs and standard deviation 0.714481 
SNP. The mean size of the left part of the neighborhoods was 1.11432 and that of the right 
part was 1.10985 SNP. 

An interesting property of the estimated neighborhoods can be observed in Figure [TJ 
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Figure 1: Left and right neighborhoods for the first 19 SNPs in the sample. 



Block i Block i+1 



Figure 2: Independent blocks obtained by the identification of intermediate points not in- 
cluded in any neighborhood. 

where we represent the length of the left an right parts of each neighborhood by an arrow. 
In this representation we can see some regions (between two adjacent SNPs) that are not 
crossed by any arrow; that is by the neighborhoods of the adjacent SNPs. This points 
divide the set of SNPs into (probabilistically) independent blocks of different sizes (see 
Remark 2.2| ). We illustrate how these blocks are obtained in Figure [2] From now on we will 



called these independent blocks of "influence windows". 

Analyzing the neigborhoods previously determined by the algorithm we obtained a to- 
tal of 48, 697 influence windows. The mean size of these influence windows was 10.27 
SNPs, the smallest window has only 1 SNP and the biggest one has 83 SNPs. The standard 
deviation of the sizes of the influence windows was 5.94 SNPs. 

In order to test the association of every influence window with the rheumatoid arthritis, 
we perform a Chi-square test of independence between the observed genotype frequencies 
in that window and the response variable indicating the presence/absence of the disease. 
In Figure [3] we show the scores, corresponding to minus the logarithm in base 10 of the 
p- value, calculated for each window in the 22 autosomes. We can observe a region of 
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Chi-square statistics for case and control SNP windows 
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Figure 3: Chi-square statistics for each influence window in the 22 autosomes. 



high association in the sixth chromosome, where 22 windows had a p-value smaller that 
1CT 16 . These results are compatible with previous studies about rheumatoid arthritis, as for 
example Irigoyen et al. ( 2005[ ); Amos et al. (2009). We can observe also some windows in 
other chromosomes that exhibited a small p-value and can therefore be associated with the 
disease. A list with the influence windows that had a p-value smaller that 10~ 4 , as well as 
the program written in C to perform this analysis is available and can be requested from the 
authors. 



4 Proof of Theorem 2.4 



We begin by giving some basic definitions and stating some results that will be useful in the 



proof of Theorem 2.4 From now on we simply write log for the logarithm in base \A\. 



Definition 4.1. The Kullback-Leibler divergence between the two probability distributions 
P and Q over A is defined by 



D(P;Q) = P(a) log 



P(o) 
Q(a) 



where, by convention, P(a) log 
Q{a) = 0. 



if P(a) 



and P{a) log = +oo if P(a) > 
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The following lemma was taken from |Csiszar and Talata] (|2006b[ Lemma 6.3). We 



include it here for completeness, but we omit its proof. 
Lemma 4.2. For any two probability distributions P and Q over A we have 

[P(a) - Q{a)f 



D(P;Q) < £ 



Q(a) 

aeA: Q(a)>0 ^ V 1 

Now we prove a result showing an upper bound for the deviation of the empirical con- 
ditional probabilities from their true values. 

Proposition 4.3. For any 5 > and for any triple (w, a, v) e A l+r+1 with < / < L and 
< r < R we have 



\p n (a\w, v) — p(a\w, v) I < 



5 \ogn 

N n (w,v) 



eventually almost surely as n — > oo. 

Proof. Define, for fixed (w, a, v) E A l+r+1 , the random variables 

Yi = l{x^ lr = wav} — p(a\w, v = w}l{x'f} r = v } , i = l,2,... 



n 



and 

n 

Z n = = N n (w,a,v) - p(a\w,v)N n (w,v) . (5) 

i=i 

The variables {Yi : i = 1,2, ... ,n} are i.i.d and a direct calculation gives, for any i = 
l,...,n,E(Yi) = Oand 



E(F i 2 ) = p(a\w, v )(1 — p(a\w, v ))p(w, v ) < 



where p(w, v) = J2 a > £A p(w, a', v). Now, by the Law of the Iterated Logarithm we have 
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that for any e > 

\Z n \ < (l + e)^V2nloglog 
eventually almost surely as n — > oo. In particular we have 



n 



\Z n \ < y/2p(w, v) 2 n\og log 
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eventually almost surely as n — > oo. Dividing both sides of the inequality by N n (w, v) we 
obtain that 

\p n (a\w,v) — p(a\w,v)\ < 



2p(w, v ) 2 n log log n 



N n (w,v) 2 

By the Strong Law of Large Numbers we have that n/N n (w, v) — > 1 /p(w, v) almost surely, 
therefore we have 



\p n (a\w,v) — p(a\w,v)\ < 



<Ap(w, v) log logn 



N n (w,v) 

eventually almost surely as n — > oo. Now, for any 5 > we have that 

4p(w, v) log log n < 5 log n 
eventually as n — > oo, and this concludes the proof of Proposition |4.3| . □ 



Proof of Theorem 2. 4 Denote by 



PMLj i? . (a^o I x A™r ) = logP^(4 ^kt^-cl^i^logn. 

We will divide the proof in two cases. 
(a) Overestimation. We have to prove that simultaneously for all pairs (I, r) ^ (l , r ), with 
< I < L , r < r < R we will have 
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eventually almost surely as n — > oo. 
Observe that 



PML^Cx^lx^) - PML Iir (x^|x^) 



c(\A\ l+r - \A\ lo+r °)\ogn - V N(wav) log - t f^ a \ w ^ ( 6 ) 

TTt+r+i p{a\w-i .-uv 1 . ro ) 

wav£A l+r+1 

where by an abuse of notation we write N(wav) = N n (w, a, v) andp(a\w, v) = p n (a\w, v). 
As these empirical probabilities are the maximum likelihood estimators we have that 

^ N(wav)logp(a\w-i o: -i,v 1:ro ) > ^ N(wav)p(a\w-i ol -i,v liro ) 

waveA l + r + 1 wav£A l + r + 1 

= N(wav )p(a\w, v) . 

wav£A l + r + 1 

Therefore, ([6]) can be upper-bounded by 

c (l_ _L)|A|^logn - N(wav)log mW,v) 



p(a\w, v) 



Now observe that 



EN(wav) log — - — - — r= y N(w-v)D(p(-\w,v) ; p(-\w,v)) 
n n 7/1 7)1 ^— ' 



1 p(a\w, v, 



where .D denotes the Kullback-Leibler divergence (see Definition 4.1 in the the Appendix) 
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Therefore, by Lemma 4.2 and Proposition 4.3 we have that for any 5 > 



N(w-v)D(p(-\w, v ) ; p(-\w, v)) 



u-v£A 1 ^ 



< N{wv)J2 

w-v£A l + r a£A 
w-v£A l + r a£A 

< 5\A\ l+r+1 \ogn 



[p(a 


w, v) - 


-p(a 


w, v) } 2 


p(a 
5 log n 


w, v) 




N(w 


■v)p(a 


w, v) 



Pmin 

eventually almost surely as n — y oo, where 



p m - m = min{p(a\w, v) : p(a\w, v) > 0, a £ A,w £ A l °, v G A r ° } . 
Then if we take 5 < cp m j n (| A\ — 1)/\A\ 2 we have that eventually almost surely as n — y oo 

simultaneously for all pairs (l,r) ^ (l ,r ), with l < I < L , r < r < R . This 
completes the proof of part (a). 

(b) Underestimation. We have to prove that simultaneously for all pairs (/, r) with I < / or 
r < r we have 

eventually almost surely as n — y oo. 
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First consider the case I < l and r < r . In this case we have that 



(l:n) U (l:n) s _ p ML , (l:n) 

0' r ' *-*l,r 

p(a\w, v 

iv riuu'f I log — — 
win 

u)a«eA i 0+ r 0+ 1 



V iVfwav) log „ _ c (\A\ lo+ro - logn 

' p(a|w_ /: _i,fi :r ) 



n 



By the Strong Law of Large Numbers we have that 



EN(wav) j p(a|iy, u) 



n p(a|w_f._i, Vi- r ) 



yj p{wav) log 



p(a|w, f ) 



p(a|w_i : _i,fi :r ) 



almost surely as n — > oo, where p{wav) = P(X = a, -^A iQ rQ = wt>). The Log-Sum 



inequality and the minimality of (7 , r ) (see Remark 2.1 ) implies that 



E, \ i p(a\wv) 
pywav) log —— r > (J 

As the number of pairs (Z,r) ^ (lo,r ) satisfying I < l and r < r is finite, this implies 
that simultaneously for all such pairs we will have 

eventually almost surely as n — > oo. 

Now consider the case (/, r) with lo < I < L and r < r . We will prove that simulta- 
neously for all l < I < L , 
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eventually almost surely as n — > oo. The proof of this fact follows the same arguments of 
part (a), by observing that 



PML^x^H 1 ;:!.) - PML^x^lxi 1 ;?) > c (l - pi ) \A\ l+r \ogn 



> 



c 1 



p(a|iy, u) 

lOg 

1 \ 



— > N(wav) log 



A\J p* 



A\ l+r logn 



> 



by Proposition 4.3 for a sufficiently small 5, eventually almost surely as n — > oo and 
simultaneously for all l < I < L . This fact, combined with what was proved for the pair 
(7 , r) before implies that 



eventually almost surely as n — > oo, simultaneously for all r < r and l < I < L . 

By observing that the same proof applies to the case (7, r) with / < / and r < r < R , 
this finishes the proof of part (b). 

□ 



Discussion 

In this paper we presented a method based on the theory of variable range Markov random 
fields to estimate the extent of dependence among SNPs allowing variable windows along 
the genome. We proposed an estimator based on the idea of penalized maximum likelihood 
to find the conditional dependence neighborhood of each SNP in the sample and we proved 
its consistency. A major advantage of our method is that it is adaptive for the extent of 
dependencies among SNPs and it is not necessary to specify a window size to capture the 
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dependency pattern which is a problem in most sliding-windows approaches. 

The core of our method is to find the basic neighborhoods of long sequences of SNPs, 
without taking into account the disease status, and then to test the association of each in- 
dependent block with the disease. Therefore, the method can be used for association with 
many different types of trait data, such as quantitative traits. It could also be applied to other 
platform types, like multiallelic markers (e.g. microsatellites), as well as to other data se- 
quences, like nucleotide or aminoacid sequences. In our analysis we considered genotypic 
data sequences, on the level of individuals, but another option is to use haplotype data, on 
the level of chromosomes, by phasing allele data. The challenge in the latter is to estimate 
the phase of the data, but good haplotype-phasing computer programs are now available. 
The flexibility of our method to handle genotype or haplotype data may be useful to assess 
different disease models. 

A reasonably large sample size is required to attain consistency of our neighborhood 
estimator. Rare long SNPs blocks, which may be present in the population, are expected to 
be observed in low frequency and may bias the findings. In this direction, an open question 
is how to obtain lower bounds for the sample size to guarantee a given level of precision 
for the neighborhood estimator. Beyond the sample size, the results are dependent of the 
density of the SNPs covering the genome and also of the size of the alphabet being modeled. 
For biallelic markers, as SNPs, these problems become less severe. 
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